Date created: 7.04.2020

In [1]:
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings('ignore')

Question: We use to hear the term “exponential growth” all the time. But the true data scientist must object - what is the typical growth model for virus pandemics?

Exponential growth is a mathematically convenient model for the early stages of an outbreak of an infectious disease. Existing methods for producing Epidemic forecasts often make an implicit assumption that growth is exponential, at least while susceptible depletion remains negligible. However, empirical studies suggest that many infectious disease outbreaks display sub-exponential or polynomical growth early in the epidemic.

To understand the CoVID-19 pandemic curve, let’s evaluate time series data of confirmed cases based on statistical approach.

First, let’s assume, that CoVID-19 Pandemic Curve has an Exponential growth.

What is Exponential growth? It is a specific way that a quantity increases over time.

The formula that illustrates the concept of exponential growth is as follows:

$$n_t = n_0*b^t$$

In case of modeling CoVID-19 Pandemic, it would show the number of Coronavirus cases at any given time $t$, $n_0$ is the number of cases at initial moment of time, $b$ is a growth factor, showing the number of people infected by each sick person.

If we model a situation where initially there is 1 infected person and each sick person infects $2$ other people ($b = 2$), we’ll end up having 1 billion infected people in just 1 month:

In [2]:
x = []
y = []
for i in range(31):
    x.append(i)
    y.append(2**i)
In [3]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Number of cases in 2 weeks", "Number of cases in a month"))


fig.add_scatter(x=x[:10], y=y[:10], line=dict(color='red', width=2), row=1, col=1)

fig.add_scatter(x=x, y=y, line=dict(color='blue', width=2), row=1, col=2)

fig.update_layout(title_text="Graph of Exponential Growth with Growth Factor 2",
                  title_font_size=20,
                 xaxis_title="Days",
                yaxis_title='''Number of Cases (Growth Factor 2)''')

fig.update(layout_showlegend=False)

#fig.show()

Let’s compare the shape of these graphs to the real curves of Total Coronavirus Cases and Deaths. The data used for the next plot were collected from European Centre for Disease Prevention and Control website https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide. It shows daily number of cases and deaths per country:

In [4]:
df = pd.read_excel('COVID-19-geographic-disbtribution-worldwide-2020-04-09.xlsx')
#world_stat.Date = pd.to_datetime(world_stat.Date)
df.head()
Out[4]:
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018
0 2020-04-09 9 4 2020 56 3 Afghanistan AF AFG 37172386.0
1 2020-04-08 8 4 2020 30 4 Afghanistan AF AFG 37172386.0
2 2020-04-07 7 4 2020 38 0 Afghanistan AF AFG 37172386.0
3 2020-04-06 6 4 2020 29 2 Afghanistan AF AFG 37172386.0
4 2020-04-05 5 4 2020 35 1 Afghanistan AF AFG 37172386.0

First of all, let's group the data by date:

In [5]:
total_cases = df.groupby("dateRep")[["cases","deaths"]].sum()
total_cases
Out[5]:
cases deaths
dateRep
2019-12-31 27 0
2020-01-01 0 0
2020-01-02 0 0
2020-01-03 17 0
2020-01-04 0 0
... ... ...
2020-04-05 86713 6115
2020-04-06 71232 4655
2020-04-07 71391 5137
2020-04-08 74902 7412
2020-04-09 84930 6339

101 rows × 2 columns

In [6]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Daily Cases", "Daily Deaths"))


fig.add_scatter(x=total_cases.index, y=total_cases['cases'], line=dict(color='blue', width=2), row=1, col=1)

fig.add_scatter(x=total_cases.index, y=total_cases['deaths'], line=dict(color='red', width=2), row=1, col=2)

fig.update_layout(title_text="Coronavirus Cases and Deaths Daily",
                                    title_font_size=20,
                 xaxis_title="Days",
                yaxis_title='''Number of Cases''')

fig.update(layout_showlegend=False)

fig.show()

Apply cumulative sum:

In [7]:
total_cum = total_cases.cumsum(axis = 0)

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Total Cases", "Total Deaths"))


fig.add_scatter(x=total_cum.index, y=total_cum['cases'], line=dict(color='blue', width=2), row=1, col=1)

fig.add_scatter(x=total_cum.index, y=total_cum['deaths'], line=dict(color='red', width=2), row=1, col=2)

fig.update_layout(title_text="Total Coronavirus Cases and Deaths",
                                    title_font_size=20,
                 xaxis_title="Days",
                yaxis_title='''Total Coronavirus Cases''')

fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)

fig.show()

The curve of the Total Coronavirus Cases and Deaths has similar to exponential shape.

Let’s find the real growth factor of the Coronavirus pandemia.

The best method to find the growth factor from empirical daily observations is to use a statistical model called Linear Regression. Linear Regression allows to estimate the best values for $a$ and $w$ in the following formula, given empirical observations for $y$ and $t$:

$$y = a +w*t,$$

where $y = \log(n_t)$, $a = \log(n_0)$, $w = \log(b)$.

The last formula contains log of the number of cases instead of the number of infections and log of growth factor instead of growth factor.

Let's apply log transformation to total cases and deaths and then use the statsmodels library to estimate the Linear Regression function:

In [8]:
import numpy as np
total_cum['log_cases'] = np.log(total_cum['cases']+1)
total_cum['log_deaths'] = np.log(total_cum['deaths']+1)
In [9]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Total Cases (Logarithmic Scale)", "Total Deaths (Logarithmic Scale)"))


fig.add_scatter(x=total_cum.index, y=total_cum['log_cases'], line=dict(color='blue', width=2), row=1, col=1)

fig.add_scatter(x=total_cum.index, y=total_cum['log_deaths'], line=dict(color='red', width=2), row=1, col=2)

fig.update_layout(title_text="Coronavirus Cases and Deaths",
                                    title_font_size=20,
                 xaxis_title="Days",
                yaxis_title='''Total Coronavirus Cases''')

fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)

fig.show()
In [10]:
d = total_cum.reset_index()

Linear Regression

In [11]:
import statsmodels.api as sm
X = d.index
X = sm.add_constant(X)

y = d.log_cases
mod = sm.OLS(y[60:], X[60:])
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_cases   R-squared:                       0.973
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                     1386.
Date:                Mon, 13 Apr 2020   Prob (F-statistic):           4.34e-32
Time:                        18:56:43   Log-Likelihood:                 17.042
No. Observations:                  41   AIC:                            -30.08
Df Residuals:                      39   BIC:                            -26.66
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.1382      0.175     35.124      0.000       5.785       6.492
x1             0.0804      0.002     37.224      0.000       0.076       0.085
==============================================================================
Omnibus:                        0.876   Durbin-Watson:                   0.041
Prob(Omnibus):                  0.645   Jarque-Bera (JB):                0.929
Skew:                           0.241   Prob(JB):                        0.629
Kurtosis:                       2.442   Cond. No.                         553.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Based on adjusted R-squared and p-value we could make a conlusion that the model fits the data really well. The statsmodels table gives the values for $a$ and $w$ under coef (in the middle):

  • The value $const$ is the value for $a$ in our Linear Regression: 10.9646
  • The value $x1$ is the value for $w$ in our Linear Regression: 0.0804

Thus, the linear regression function is $$y = 6.1382 + 0.0804*t$$

Going back to the original formula for exponential growth, we get the actual formula for the coronavirus pandemic: $$n_t = 463 *1.08^t$$

In [15]:
def linear_predictions1(t):
    return np.exp(6.1382) * (np.exp(0.0804) ** t)

#d = d.reset_index()
d['index'] = d.index
d['case_predictions'] = d['index'].apply(linear_predictions1)

df_melt = d.melt(id_vars='index', value_vars=['cases', 'case_predictions'])


fig = px.line(df_melt, x='index' , y='value' , color='variable')

fig.update_layout(title_text="Coronavirus Cases and Case Projections Assuming Exponential Growth",
                  title_font_size=20,
                 xaxis_title="Days since 2020-01-01",
                yaxis_title='''Total Coronavirus Cases''')
fig.show()

Similarly obtain the actual formula for the CoVID-19 deaths data: $$n_t = 6.21 *1.1^t$$

In [16]:
import statsmodels.api as sm
X = d.index
X = sm.add_constant(X)
y = d.log_deaths
mod = sm.OLS(y[60:], X[60:])
res = mod.fit()
print(res.summary())
def linear_predictions(t):
    return np.exp(1.8264) * (np.exp(0.0947) ** t)
d['death_predictions'] = d['index'].apply(linear_predictions)
df_melt = d.melt(id_vars='index', value_vars=['deaths', 'death_predictions'])


fig = px.line(df_melt, x='index' , y='value' , color='variable')

fig.update_layout(title_text="Coronavirus Deaths and Death Projections Assuming Exponential Growth",
                  title_font_size=20,
                 xaxis_title="Days since 2020-01-01",
                yaxis_title='''Total Coronavirus Deaths''')
fig.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             log_deaths   R-squared:                       0.974
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     1451.
Date:                Mon, 13 Apr 2020   Prob (F-statistic):           1.82e-32
Time:                        18:59:21   Log-Likelihood:                 11.297
No. Observations:                  41   AIC:                            -18.59
Df Residuals:                      39   BIC:                            -15.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8264      0.201      9.084      0.000       1.420       2.233
x1             0.0947      0.002     38.089      0.000       0.090       0.100
==============================================================================
Omnibus:                        1.574   Durbin-Watson:                   0.041
Prob(Omnibus):                  0.455   Jarque-Bera (JB):                1.474
Skew:                           0.432   Prob(JB):                        0.479
Kurtosis:                       2.660   Cond. No.                         553.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Despite the fact, that exponential curve fitted observed data almost perfectly, Exponential based infected population projection are subject to high risk due to daily changing situation. The repeated projection at 3-7 days are necessary to control the curve.

Only in China the first wave of pandemic has reached to its end. Rest of the countries are still active areas and time of its peak, infected population size will depend on effectiveness of community supported national action policies such that social distancing, health system capacity, access to food and medicine and others.

Making Predictions

Using the function that we have estimated using the Exponential Growth curve, let's predict infected population size in a week after the last day of the dataset 2020-04-09:

In [17]:
x = [i for i in range(108)]
y = linear_predictions1(x)
y1 = linear_predictions(x)

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Total Cases", "Total Deaths"))


fig.add_trace(go.Scatter(x=x, y=y), row=1, col=1)

fig.add_trace(go.Scatter(x=x, y=y1), row=1, col=2)

fig.update_layout(title_text="CoVID-19 Projection and 1 Week Prediction Assuming Exponential Growth",
                  title_font_size=20,
                 xaxis_title="Days since 2020-01-01",
                yaxis_title='''Total Cases and Deaths''')

fig.update(layout_showlegend=False)
#fig.update_xaxes(tickangle=90)

fig.show()

Predicting the growth as polynomial function

Now the question arise: is exponential growth describes the observed data best of all?

The obvious change point in infected population is 22nd of February - the date when the first CoVid-19 was registered in Italy. The pandemic curve starting from this stage may be estimated by polynomial curve of the 3rd order even better: $$n_t = a_0 + a_1*t+a_2*t^2 + a_3*t^3$$

In [18]:
d1 = total_cum[53:].reset_index().reset_index()
d1.head()
Out[18]:
index dateRep cases deaths log_cases log_deaths
0 0 2020-02-22 77804 2359 11.261961 7.766417
1 1 2020-02-23 78812 2463 11.274833 7.809541
2 2 2020-02-24 79339 2619 11.281498 7.870930
3 3 2020-02-25 80132 2698 11.291443 7.900637
4 4 2020-02-26 80995 2762 11.302155 7.924072
In [19]:
d1['t'] = d1['index']
d1['t^2'] = d1['t'].apply(lambda x: x**2)
d1['t^3'] = d1['t'].apply(lambda x: x**3)
d1['t^4'] = d1['t'].apply(lambda x: x**4)

X = d1[['t','t^2','t^3']]
X = sm.add_constant(X)
y = d1.cases
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  cases   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                     7764.
Date:                Mon, 13 Apr 2020   Prob (F-statistic):           6.17e-60
Time:                        19:00:18   Log-Likelihood:                -538.56
No. Observations:                  48   AIC:                             1085.
Df Residuals:                      44   BIC:                             1093.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       8.359e+04   1.01e+04      8.285      0.000    6.33e+04    1.04e+05
t           1861.7047   1878.680      0.991      0.327   -1924.526    5647.936
t^2         -412.8750     93.446     -4.418      0.000    -601.203    -224.547
t^3           21.9020      1.306     16.767      0.000      19.269      24.534
==============================================================================
Omnibus:                        3.438   Durbin-Watson:                   0.175
Prob(Omnibus):                  0.179   Jarque-Bera (JB):                2.372
Skew:                          -0.482   Prob(JB):                        0.305
Kurtosis:                       3.508   Cond. No.                     1.51e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.51e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [21]:
def polinomial_predictions(t):
    return np.asarray(83590) + np.asarray(1861.7047)*t -np.asarray(412.8750)*np.power(t,2) + np.asarray(21.9020)*np.power(t,3)

d1['case_predictions'] = d1['t'].apply(polinomial_predictions)
df_melt = d1.melt(id_vars='index', value_vars=['cases', 'case_predictions'])


fig = px.line(df_melt, x='index' , y='value' , color='variable')

fig.update_layout(title_text="Coronavirus Cases and Projection Assuming Polynomial Growth",
                  title_font_size=20,
                 xaxis_title="Days since 2020-02-22",
                yaxis_title='''Total Coronavirus Cases''')
fig.show()

As we have assumed, polynomial growth fits observed data starting 2020-02-22 much better (R-squared = 0.998). Similarly obtain the formula for the CoVID-19 deaths data assuming polynomial growth and make predictions:

In [22]:
import statsmodels.api as sm
X = d1[['t','t^2','t^3']]
X = sm.add_constant(X)
y = d1.deaths
mod = sm.OLS(y, X)
res = mod.fit()
#print(res.summary())
def polinomial_predictions1(t):
    return np.asarray(1165.3676) + np.asarray(729.9956)*t -np.asarray(67.7441)*np.power(t,2) + np.asarray(1.9513)*np.power(t,3)

d1['death_predictions'] = d1['t'].apply(polinomial_predictions1)
df_melt = d1.melt(id_vars='index', value_vars=['deaths', 'death_predictions'])


fig = px.line(df_melt, x='index' , y='value' , color='variable')

fig.update_layout(title_text="Coronavirus Deaths and Projection Assuming Polynomial Growth",
                  title_font_size=20,
                 xaxis_title="Days since 2020-02-22",
                yaxis_title='''Total Coronavirus Deaths''')
fig.show()

How is the implementation of existing strategies affecting the rates of COVID-19 infection?

The Covid-19 virus has affacted lives of everyone. The governments are implementing strict measure of quarantine in order to stop the spread of the virus. In this notebook, I would like to summarize the statistics of the virus progress around the world and to see, whether the actions of social-distancing and isolation are impacting the speed and fatality of contracting. All of the hypothesis will be build around the data availible. However, we shall not forget, that this data might be deviated from the truth and thus, all conclusions derived from this data may not be 100% accurate.

World's COVID-19 Statistics

The first data source is a worldwide-aggregated statistics of the number of confirmed Coronavirus cases, number of recovered patients, deathes and increase rate of the virus spread. The datasource is updated daily. The first record was created on the 22nd of January 2020

In [23]:
world_stat = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/master/data/worldwide-aggregated.csv')
world_stat.Date = pd.to_datetime(world_stat.Date)
world_stat.head()
Out[23]:
Date Confirmed Recovered Deaths Increase rate
0 2020-01-22 555 28 17 NaN
1 2020-01-23 654 30 18 17.837838
2 2020-01-24 941 36 26 43.883792
3 2020-01-25 1434 39 42 52.391073
4 2020-01-26 2118 52 56 47.698745
In [24]:
fig = px.line(world_stat)

fig.add_scatter(x=world_stat['Date'], y=world_stat['Confirmed'], line=dict(color='blue', width=2), name='Confirmed cases')
fig.add_scatter(x=world_stat['Date'], y=world_stat['Deaths'], line=dict(color='red', width=4), name='Deaths')
fig.add_scatter(x=world_stat['Date'], y=world_stat['Recovered'], line=dict(color='green', width=3), name='Recovered')

fig.update_layout(title_text="World's Covid-19 Statistics",
                  title_font_size=20,
                 xaxis_title="Date",
                yaxis_title="Number of citizens")
fig.show()

The plot above illustrates the exponential growth of confirmed coved-19 cases over the course of 4 month. The number of recovered is substainsially higher, than of the number of patients, who passed away. However, there is no decrease (that might have been expected due to governmnet measures) in the virus spread. Maybe, this is the result of higher testing numbers and/or incubation period of the virus. In the future work (if the data availible), this can be explored further.

In [25]:
fig = go.Figure(data=[
        go.Bar(name='Recovered', x=world_stat['Date'], y=world_stat['Recovered']),
        go.Bar(name='Deaths', x=world_stat['Date'], y=world_stat['Deaths'])])

fig.update_layout(barmode='group',
                 title_text="World Covid-19 Deaths vs Recovery",
                  title_font_size=20,
                 xaxis_title="Date",
                yaxis_title="Number of citizens")
fig.show()

This plot again, illustarates the number of recovered vs the number of deathes, where number of recovered grows by a steeper curve. The number of documented from covid-19 deathes on the 6th of Aprile 2020 is over 74k. Comparing to the pandemics, it is on the same level as 2009 swine flu pandemic by the number of human lives taken.

In [26]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Deaths to New Cases Ratio", "Recovered to New Cases Ratio"))


fig.add_scatter(x=world_stat['Date'], y=world_stat['Deaths']/world_stat['Confirmed'], line=dict(color='red', width=2), row=1, col=1)

fig.add_scatter(x=world_stat['Date'], y=world_stat['Recovered']/world_stat['Confirmed'], line=dict(color='blue', width=2), row=1, col=2)


fig.show()

The figure_1 shows the ration of deathes to the confirmed cases over time. And figure_2 shows the ratio of recoveries to total registered cases respectively. The death ratio was at its minimum on the 5th of February, with only 2.04% and maxed out on the 6 of April with 5.55% and appeares to be growing. This might be due to lack of hospital beds and ITUs, as the awerness is raised. The recovery ratio peaked on the 8th of March with 55.2%. Since the middle of March the was a recrease in recovery rate, where since 28th of March is plateaus at approximately 20%.

- The mean avarage values are:

- mortality: 3.58%
- recovery: 24.9 %
In [27]:
(world_stat['Recovered']/world_stat['Confirmed']).mean()*100
Out[27]:
24.911277955820395
In [28]:
(world_stat['Deaths']/world_stat['Confirmed']).mean()
Out[28]:
0.03577511098579713

Aggregated Statistics by Country

The second datasource comes from kaggle competition (https://www.kaggle.com/c/covid19-global-forecasting-week-3)

This dataframe contains the information about each country's statistic separately, dated from 22nd of January to 5th of April 2020

During this part of descriptive analysis, we can take a closer look at each country's state.

In [30]:
agg_stats = pd.read_csv('covid19-global-forecasting-week-3/train.csv')
agg_stats.Date = pd.to_datetime(agg_stats.Date)
agg_stats.head()
Out[30]:
Id Province_State Country_Region Date ConfirmedCases Fatalities
0 1 NaN Afghanistan 2020-01-22 0.0 0.0
1 2 NaN Afghanistan 2020-01-23 0.0 0.0
2 3 NaN Afghanistan 2020-01-24 0.0 0.0
3 4 NaN Afghanistan 2020-01-25 0.0 0.0
4 5 NaN Afghanistan 2020-01-26 0.0 0.0
In [31]:
grouped = agg_stats.groupby('Country_Region')[['ConfirmedCases', 'Fatalities', 'Date']].agg(max)
grouped.reset_index(inplace=True)
In [32]:
iso_df = pd.read_csv('plotly_countries_and_codes.csv')
iso_dict = pd.Series(iso_df.CODE.values,index=iso_df.COUNTRY).to_dict()
iso_dict['US'] = 'USA'
grouped['iso'] = grouped['Country_Region'].map(iso_dict)

fig = px.choropleth(grouped, locations="iso",
                    color="ConfirmedCases", 
                    hover_name="Country_Region", 
                    color_continuous_scale=px.colors.sequential.Plasma)

fig.update_layout(title_text=f"Countries vs Confirmed Cases", title_font_size=20)
                
fig.show()

This map shows the number of convirmed cases by the intensity of color. US, Italy and Spain are the ones of the most affected.

In [33]:
fig = px.choropleth(grouped, locations="iso",
                    color="Fatalities", 
                    hover_name="Country_Region", 
                    color_continuous_scale=px.colors.sequential.Plasma)

fig.update_layout(title_text=f"Countries: Fatality rates", title_font_size=20)
fig.show()

Even though the number of infected in the US in substantially higher, as of April 5th, Italy has lost the highest number of citizens - 15 887

The barcharts below summarize the information about top20 and bottom20 countries by the number of confirmed cases and deaths

In [34]:
fig = make_subplots(
    rows=1, cols=2, vertical_spacing=0.6,
    subplot_titles=("20 Countries with the most confirmed cases", "20 countries with the least confirmed cases"))

grouped = grouped.sort_values(by='ConfirmedCases', ascending=False)
grouped_top_20 = grouped.iloc[:20]
grouped_bot_20 = grouped.iloc[-20:].sort_values(by='ConfirmedCases')
fig.add_trace(go.Bar(x=grouped_top_20['Country_Region'], y=grouped_top_20['ConfirmedCases'], name='Confirmed_high'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped_bot_20['Country_Region'], y=grouped_bot_20['ConfirmedCases'], name='Confirmed_low'), row=1, col=2)
fig.update_xaxes(tickangle=65)
fig.update_layout(title_font_size=20)
fig.show()
In [35]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('20 Countries with the highest Death Rate', '20 countries with the lowest (or == 1) death rate'))

grouped = grouped.sort_values(by='Fatalities', ascending=False)
grouped_top_20 = grouped.iloc[:20]

grouped_bot_20 = grouped[grouped.Fatalities>0].iloc[-20:].sort_values(by='Fatalities')
fig.add_trace(go.Bar(x=grouped_top_20['Country_Region'], y=grouped_top_20['Fatalities'], name='Mortality_high'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped_bot_20['Country_Region'], y=grouped_bot_20['Fatalities'], name='Mortality_low'), row=1, col=2)
fig.update_xaxes(tickangle=65)
fig.update_layout(title=f'{agg_stats.Date.max()}', title_font_size=20)
fig.show()

List of Countries, where no registed deaths from covid-19 occured

In [36]:
grouped[grouped.Fatalities==0][['Country_Region']].reset_index(drop=True)
Out[36]:
Country_Region
0 Holy See
1 Mozambique
2 Burundi
3 Saint Vincent and the Grenadines
4 Central African Republic
5 Somalia
6 Sierra Leone
7 Nepal
8 Bhutan
9 Papua New Guinea
10 Chad
11 Eswatini
12 Eritrea
13 Saint Kitts and Nevis
14 Guinea-Bissau
15 Malta
16 Vietnam
17 Guinea
18 Cambodia
19 Rwanda
20 Djibouti
21 Madagascar
22 Uganda
23 Maldives
24 Seychelles
25 Equatorial Guinea
26 Namibia
27 Dominica
28 Mongolia
29 Fiji
30 Saint Lucia
31 Laos
32 Grenada
33 Timor-Leste
In [37]:
fig = go.Figure(data=[
        go.Bar(name='Recovered', x=grouped['Country_Region'], y=grouped['ConfirmedCases'])
        ])

fig.update_layout(barmode='group',
                 title_text=f"Countries vs Confirmed Cases Total",
                  title_font_size=20,
                 xaxis_title="Country",
                yaxis_title="Confirmed Cases Total", xaxis_tickangle=-45)

fig.show()

Government Measures against spread of virus

Now, when we have built an understanding around general statistics around coved-19, the time has come to look through the measures, that the governments are undertaking and try to infer their impact on the situation.

The dataset used in this part contains info about government actions from different souces. The last datapoint added is dated March 27th

In this section, I would like to answer the following questions:

  • What are the most common measures taken?
  • When did the majority of governments announced quarantine?
  • Which countries reacted to late?
  • Is there a positive tendency in mortality after the measures were undertaken?
In [38]:
gov_measures = pd.read_csv('acaps-covid-19-government-measures-dataset.csv')
na_date = gov_measures[gov_measures.date_implemented.isna()]
gov_measures = gov_measures[~gov_measures.date_implemented.isna()]

lst = list(gov_measures[gov_measures.date_implemented.map(len) == 5]['date_implemented'])
tbd = gov_measures[gov_measures.date_implemented.isin(lst)]
gov_measures =  gov_measures[~gov_measures.date_implemented.isin(lst)]
gov_measures['date_implemented'] = pd.to_datetime(gov_measures['date_implemented'])

gov_measures.category = gov_measures.category.replace({'Movement Restriction':'Movement restrictions',
                                                       'Movement Restrictions':'Movement restrictions',
                                                      'Social and Economic Measures': 'Social and economic measures',
                                                      'Social Distancing': 'Social distancing'})

# gov_measures[['country', 'iso', 'measure', 'date_implemented', 'category']].sort_values(by='date_implemented')
In [39]:
measures = pd.DataFrame(gov_measures.measure.value_counts()).reset_index()
measures.columns = ['measure', 'count']

fig = go.Figure(data=[
        go.Bar(name='Government Measures', x=measures['measure'], y=measures['count'])
        ])

fig.update_layout(barmode='group',
                 title_text="Most common Government Measures to stop spread of Covid-19",
                  title_font_size=20,
                yaxis_title="Number of Governments to undertake", xaxis_tickangle=-40)
fig.show()

Intuitively, we could have guessed, that the most popular government measures are the quarantine policy and travel restrictions. All of these restrictions can be summarized into 4 categories:Public Health, Social & Economic, Social distaning and Movement Restrictions. More detailed breakdown on the graph below.

In [42]:
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=(list(gov_measures.category.unique())))


m = list(gov_measures.category.unique())
lay = {0:(1,1),
      1:(1,2),
      2:(1,3),
      3:(2,1),
      4:(2,2),
      5:(2,3)}
for i, category in enumerate(m):
    temp = pd.DataFrame(gov_measures[gov_measures.category==category]['measure'].value_counts()).reset_index()

    fig.append_trace(go.Bar(x=temp['index'], y=temp['measure']), row=lay[i][0], col=lay[i][1])
    
fig.update_layout(title_text="Measures by category",
                  title_font_size=30,
                 height=1900, width=1000)
    
fig.update_xaxes(tickangle=65)
fig.update(layout_showlegend=False)

fig.show()

The next chart illustates the cumulative number of policies effective through the course of January-April 2020. The most "awakening" period started after the 1st of March

In [43]:
implementations = pd.DataFrame(gov_measures.date_implemented.value_counts()).reset_index()
implementations.columns = ['date', 'count']
implementations = implementations.sort_values(by='date')
implementations['running_count'] = implementations[['count']].cumsum()

fig = px.line(implementations)
fig.add_scatter(x=implementations['date'], y=implementations['running_count'],mode='lines+markers', name="measures undertaken")

fig.update_layout(title_text="# Government Restrictions Effeactive around the World",
                  title_font_size=20,
                 xaxis_title="Date",
                yaxis_title="Number of restriction")

fig.show()

The chart of number of implemented restrictions reminds of the exponensial growth of confirmed cases around the world

In [44]:
g_m = gov_measures.groupby(['country', 'measure'])['date_implemented'].agg('max').reset_index()
g_m_quarantine = g_m[(g_m.measure == 'Introduction of quarantine policies')].sort_values(by='date_implemented')

fig = px.scatter(x=g_m_quarantine['country'],
                 y=g_m_quarantine['date_implemented'])
fig.update_layout(xaxis_type='category',
                  title_text='Timeline of countries starting Quarantine',
                 xaxis_title="Country",
                yaxis_title="Date effective")
fig.update_xaxes(tickangle=65, title_font_size=20)
fig.show()

This scatterplot shows, which countries were the first and the last ones to react to the virus outbreak. We will take a closer look at the "first" countries - Afganistan and Germany, who introduced the first policies on the 14th and 16th of February respectively. As well as some the last ones - Cuba and Georgia, who reacted after the 20th of March.

In [45]:
def measures_vs_country(c_of_int, i):
    country_ = pd.merge(agg_stats[agg_stats.Country_Region==c_of_int][['Country_Region', 'Date', 'ConfirmedCases','Fatalities']],
            gov_measures[gov_measures.country==c_of_int][['country', 'measure', 'date_implemented', 'category']],
            left_on=['Date', 'Country_Region'], right_on=['date_implemented', 'country'], how='left')[['Country_Region', 'Date', 'ConfirmedCases', 'Fatalities', 'measure', 'category']]

#     country_ = country_.sort_values(by=['Date'], ascending=False)
#     country_.drop_duplicates(['measure'], keep='first', inplace=True)
    fig = px.line(country_)
    fig.add_scatter(x=country_['Date'], y=country_['ConfirmedCases'],mode='lines', name="ConfirmedCases")
    fig.add_scatter(x=country_['Date'], y=country_['Fatalities'],mode='lines', name="Fatalities")

    work = gov_measures[gov_measures.country== c_of_int].sort_values(by='date_implemented')
    work = work[~(work.duplicated('measure', keep='last'))]

    fig.add_scatter(x=work['date_implemented'], y=[_*i for _ in range(1, len(work)+1)],mode='markers+text', text=work['measure'], name="Measures", hovertext=work['measure'])

    fig.update_layout(title=f'Number of Cases in {c_of_int} and dates of measures undertaken', title_font_size=20)
    fig.show()
In [46]:
measures_vs_country('Afghanistan', 20)
In [47]:
measures_vs_country('Germany', 5000)

Both countries - Germany and Afghanistan declared State of the Emergency, when there were 0 confirmed cases of covid. The populations of countries in 35.5 M - Afghanistan and 83 M Germany. There is no reliable info about number of tests conducted in Afghanistan. Germany - (based on: https://www.worldometers.info/coronavirus/) 918,460 test as of April 7th.

In Afghanistan the first case registered on 24th of February, but the total quarantine was introduced sooner - on the 12th of February.

Germany quarantine policies was introduced on the 14th of February. General Lockdown was anounced on the 21st of March when number of affected reaced 22K and number of Fatalities - 84.

In [48]:
measures_vs_country('Cuba', 20)

It is clear from the graph, that Cuba reacted to late. By the time citizens were quarantined (March 24th), there were already 48 infected (considering the population of 11M). The growth of confirmed cases is a steep exponential curve.

In [49]:
measures_vs_country('Georgia', 20)

Similarly, Georgia started taking action only after the official number of cases was above 30.

Next, we will take a look at which countries managed to react to the virus before its spread

In [50]:
into_to_q = pd.merge(agg_stats[['Country_Region', 'Date', 'ConfirmedCases','Fatalities']],
        gov_measures[['country', 'measure', 'date_implemented', 'category']],
        left_on=['Date', 'Country_Region'], right_on=['date_implemented', 'country'], how='left')[['Country_Region', 'Date', 'ConfirmedCases', 'Fatalities', 'measure', 'category']]

t = into_to_q.groupby(['Country_Region', 'measure'])['Date'].agg(max).reset_index()
t = t[t.measure == 'Introduction of quarantine policies']
In [51]:
def f(date, country):
    try:
        return True if t[t.Country_Region==country]['Date'].values[0] <= date else False
    except IndexError:
        return 'unknown'
In [52]:
into_to_q['quarantine_affective'] = into_to_q.apply(lambda x: f(x['Date'], x['Country_Region']), axis=1)
# discarding countries, that we have no ongo about 
into_to_q = into_to_q[~(into_to_q['quarantine_affective']=='unknown')]

List of countries, that introduced quarantine before the first confirmed case

In [53]:
con_before_quar = list(set(into_to_q[(into_to_q.quarantine_affective==True) & (into_to_q.ConfirmedCases == 0)]['Country_Region']))
pd.DataFrame(con_before_quar, columns=['country'])
Out[53]:
country
0 Botswana
1 Saint Lucia
2 Haiti
3 Netherlands
4 Chad
5 Afghanistan
6 Burundi
7 Canada
8 Sierra Leone
9 Turkey
10 France
11 Mauritius
12 El Salvador
13 Liberia
14 Guatemala
15 Kyrgyzstan
16 Papua New Guinea
17 Grenada
18 Mali
19 Syria
20 Belize
21 Angola
22 Uganda
23 Uruguay

Next, I'd like to comapare the mortality per 1 milion of the above countries, as well as countries that began quarantine later. I will load the population dataset from https://www.kaggle.com/tanuprabhu/population-by-country-2020

there is availible information about the med. age of the citizens, which we might use later.

In [54]:
population = pd.read_csv('population_by_country_2020.csv')[['Country (or dependency)', 'Population (2020)', 'Med. Age']]
In [55]:
stats_population = pd.merge(grouped, population, left_on=['Country_Region'], right_on=['Country (or dependency)'])
stats_population['death_per_1M'] = (stats_population['Fatalities']/stats_population['Population (2020)'])*1000000
stats_population['infected_per_1M'] = (stats_population['ConfirmedCases']/stats_population['Population (2020)'])*1000000

Based on the information, availible to us, we can compare mean values of deaths per 1M and mean value of infected per 1M between sets of countries who introduced policies before the first case, and the ones after.

In [56]:
print("Mean number of deaths per 1M in countries starting quarantin early: ", stats_population[stats_population.Country_Region.isin(con_before_quar)]['death_per_1M'].mean())
print("Mean number of cases per 1M in countries starting quarantin early: ", stats_population[stats_population.Country_Region.isin(con_before_quar)]['infected_per_1M'].mean())

print()
print("Mean number of deaths per 1M in countries starting quarantin late: ", stats_population[(stats_population.Country_Region.isin(into_to_q.Country_Region)) & (~stats_population.Country_Region.isin(con_before_quar))]['death_per_1M'].mean())
print("Mean number of cases per 1M in countries starting quarantin late: ", stats_population[(stats_population.Country_Region.isin(into_to_q.Country_Region)) & (~stats_population.Country_Region.isin(con_before_quar))]['infected_per_1M'].mean())
Mean number of deaths per 1M in countries starting quarantin early:  12.783801598972957
Mean number of cases per 1M in countries starting quarantin early:  169.82576720535494

Mean number of deaths per 1M in countries starting quarantin late:  17.20357792484505
Mean number of cases per 1M in countries starting quarantin late:  328.6351236307837

How does the median age of the citizens affect mortality?

In [57]:
stats_population = stats_population[~(stats_population['Med. Age']=='N.A.')]
stats_population['Med. Age']=stats_population['Med. Age'].map(int)
In [58]:
age_dist = stats_population.groupby('Med. Age').agg({'Country_Region':lambda x: list(x), 'death_per_1M':'mean'}).reset_index()
In [59]:
fig = go.Figure(data=[
        go.Bar(name='Avarage Number of deaths per 1M based on median age of citizens', x=age_dist['Med. Age'], y=age_dist['death_per_1M'], hoverlabel=None, text=age_dist['Country_Region'])
        ])

fig.update_layout(xaxis_type='category',
                  barmode='group',
                 title_text="Avarage Number of deaths per 1M based on median age of citizens",
                  title_font_size=20,
                yaxis_title="Median Age", xaxis_tickangle=-40)
fig.show()

Based on this plot, the conclusion can be made, that countries with older population have greater number of deathes. Thus, older people are at higher risk of dying due to coronavirus.

The countries and their median age distribution can be seen on the plot below.

In [60]:
stats_population = stats_population.sort_values(by='Med. Age', ascending=False)
fig = go.Figure(data=[
        go.Bar(name='Median age of citizens', x=stats_population['Country_Region'], y=stats_population['Med. Age'], hoverlabel=None)
        ])

fig.update_layout(xaxis_type='category',
                  barmode='group',
                 title_text="Median age of citizens",
                  title_font_size=20,
                yaxis_title="Median Age", xaxis_tickangle=-40)
fig.show()

Ukraine: closer look

I am Ukrainian, and I am curios about the situation in my homeland. The first case was documented on the 3rd of March. School closed on the 9th, national quarantine started on the 11th (when number was still 1). This 8 day delay in reaction did not help to slow down the exponential growth of infected.

In [61]:
measures_vs_country('Ukraine', 100)
In [62]:
print("Number of confirmed cases per 1M in Ukraine: ", stats_population[stats_population.Country_Region=='Ukraine']['infected_per_1M'].values[0])
print()
print("Number of deaths per 1M in Ukraine: ", stats_population[stats_population.Country_Region=='Ukraine']['death_per_1M'].values[0])
print()
print('Mortality in Ukraine: ', round(stats_population[stats_population.Country_Region=='Ukraine']['Fatalities'].values[0]/stats_population[stats_population.Country_Region=='Ukraine']['ConfirmedCases'].values[0]*100, 2),' %')
Number of confirmed cases per 1M in Ukraine:  33.42955037803517

Number of deaths per 1M in Ukraine:  1.0289533290092903

Mortality in Ukraine:  3.08  %

Cabinet of ministers of Ukraine provides the data about hospitals, doctors' availibility and numbers of artificial breathing ventilators (https://covid19.gov.ua/vidkryti-dani).

Based on the info about number of new patients, what percentage of people will be able to recieve medical aid?

In [82]:
doctors = pd.read_csv('covid19_likari.csv', sep=',')
doctors['Дата'] = pd.to_datetime(doctors['Дата'])
doctors['Внесено'] = pd.to_datetime(doctors['Внесено']).dt.date
doctors['Внесено'] = pd.to_datetime(doctors['Внесено'])
doctors = doctors[(doctors['Внесено'] == '2020-08-04')]
doctors.head()
Out[82]:
id Назва Адреса Область Дата Пользователь txt Внесено Всі лікарі, які можуть бути задіяні з COVID-19 (усі спеціальності) Лікарі, які вже працюють з пацієнтами з COVID-19 ... ПЦР використано Кількість тампонів та стерильних пробірок з транспортним середовищем залишок Кількість тампонів та стерильних пробірок з транспортним середовищем використано за добу З них анестезіологів З них інфекціоністів Боксовані палати Зайнято боксованих Кількість кисневих генераторів до ШВЛ Рентген апарат Наявність системи централізованого постачання кисню
1237 42799 КНП "Міська лікарня №1" Чернівецької міської ради вул. Червоноармійська, 226 Чернівецька 2020-08-04 505 1110972 2020-08-04 66 24 ... 0 544 50 9 2 50 25 1 0 0
1239 42793 Комунальний заклад "Міська багатопрофільна клі... вул. Ближня, буд.31 Дніпропетровська 2020-08-04 304 1280527 2020-08-04 21 0 ... 0 0 0 18 1 1 0 4 1 1
1244 42761 КОМУНАЛЬНЕ НЕКОМЕРЦІЙНЕ ПІДПРИЄМСТВО "ТРОСТЯНЕ... м.Тростянець, вул.Нескучанська, буд.7 Сумська 2020-08-04 446 1981508 2020-08-04 15 0 ... 0 194 4 4 2 16 0 0 1 1
1247 42833 Комунальна організація (установа, заклад) "Шос... вул. Щедріна, буд.1 Сумська 2020-08-04 441 1981514 2020-08-04 22 3 ... 0 330 20 6 3 21 5 0 1 1
1249 42692 КНП "Чорноморська лікарня" Чорноморської міськ... Одеська обл., м. Чорноморськ, вул. В. Шума, 4 Одеська 2020-07-04 423 1982212 2020-08-04 60 1 ... 0 16 0 8 3 8 3 1 1 1

5 rows × 78 columns

In [84]:
fig = make_subplots(
    rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]],
    subplot_titles=("Availibility of doctors", 'Availibility of nurses'))
labels = ['Availible','Already working with covid-19 patients']

total = doctors['Всі лікарі, які можуть бути задіяні з COVID-19 (усі спеціальності)'].sum()
involved_with_patients = doctors['Лікарі, які вже працюють з пацієнтами з COVID-19'].sum()
availible = total-involved_with_patients
values = [availible, involved_with_patients]
fig.append_trace(go.Pie(labels=labels, values=values, hole=.5), row=1, col=1)

total = doctors['Всі медсестри'].sum()
involved_with_patients = doctors['Медсестри, які вже працюють з пацієнтами з COVID-19'].sum()
availible = total-involved_with_patients
values = [availible, involved_with_patients]
fig.append_trace(go.Pie(labels=labels, values=values, hole=.5), row=1, col=2)
 
    
fig.update_layout(title='Medical staff (Ukraine, 08.04.2020)', title_font_size=20)
fig.show()

Based on the most recent availible data (8th of April), 730 doctors and 1287 nurses are already occupied with covid-19 patients. As of 8th of April there were 1668 confirmed cases, 35 healed patients and 52 letal cases. This is 1518 patients, where only part of them requires hospitalization. In case of more infected - Ukraine has enough professionals. Next, we will take a look at availibility of hospital spaces and ICUs

In [85]:
labels = ['High Class Ventilators', 'Mid Class Ventilators', 'Kids Ventilators']

high_class = doctors['Справних ШВЛ високого класу'].sum() 
mid_class = doctors['Справних ШВЛ середнього класу'].sum() 
# portable = doctors['Справних портативних (транспортних) ШВЛ '].sum() 
kids = doctors['Справних ШВЛ дитячих (0-6 років)'].sum() 

outer_values = [high_class,mid_class,kids]

high_class_used = doctors['ШВЛ високого класу підключено до пацієнтів'].sum() 
high_class_avbl = high_class - high_class_used
mid_class_used = doctors['ШВЛ середнього класу підключено до пацієнтів'].sum()  
mid_class_avlb = mid_class - mid_class_used
kids_used = doctors['Пацієнтів підключених до ШВЛ дитячих'].sum()  
kids_avlb = kids - kids_used

inner_values = [high_class_used, high_class_avbl, mid_class_used, mid_class_avlb,
                 kids_used, kids_avlb]
labels2 = ['used_', 'availible_', 'used', 'availible', '_used', '_availible']
common_props = dict(labels=labels,
                    values=values,)

trace1 = go.Pie(
    hole=0.5,
    sort=False,
    direction='clockwise',
    domain={'x': [0.15, 0.85], 'y': [0.15, 0.85]},
    values=inner_values,
    textinfo='label',
    textposition='inside',
    labels=labels2,
    marker={'line': {'color': 'white', 'width': 1}}
)

trace2 = go.Pie(
    hole=0.7,
    sort=False,
    direction='clockwise',
    values=outer_values,
    labels=labels,
    textinfo='label',
    textposition='outside',
    marker={'colors': ['green', 'red', 'blue'],
            'line': {'color': 'white', 'width': 1}}
)

fig = go.FigureWidget(data=[trace1, trace2])
fig.update_layout(title='Availibility of Artificial Breathing Ventilators (Ukraine, 08.04.2020)', title_font_size=20,
    showlegend=False)
fig.show()
In [86]:
labels = ['Availible Hospital Beds','Occupied by covid-19 patients', 'Occupied by other patients']

beds_total = doctors['Ліжок загалом'].sum() 
occupied = doctors['Зайнято загалом ліжок'].sum() 
occupied_cv = doctors['Зайнято хворими з COVID-19'].sum() 
occupied_others = occupied - occupied_cv
avlb_beds = beds_total - occupied

values = [avlb_beds, occupied_cv, occupied_others]
fig= go.Figure(data=[(go.Pie(labels=labels, values=values, hole=.5))])


fig.update_layout(title='Hospital beds availibility (Ukraine, 08.04.2020)', title_font_size=20)
fig.show()

Concluding from the data availible - Ukraine has the potential to fight with the virus and to save lives of the citizens. Out of 1668 (8.04.20) confirmed cases only 748 are currently hospitalized, it is 44%. With over 39K availible hospital places left, there should be no shortage of spots. However, for the ICUs - we don't know exactly whether they are in use by covid or other patients, so it is hard to estimate their future availibility.

Conclusions

The most important conclusions from our analysis are the following:

- Countries that introduced quarantine policies before the first case of coronavirus was registered on their territory generally tend to have a lower number of infected than the countries, who were a little bit slower to react. For example, the average number of infected per 1M of the population in "early birds" countries is 150. When in contrast, the "late" countries have 345 infected/1M. That is a 2.3-time difference! We cannot stress enough how important it is to follow the quarantine and social distancing - it makes a significant impact!

- The higher the median age of the country's population - the higher the mortality rate from COVID-19. It supports the claim that the virus is more dangerous to older people, thus the higher mortality. The countries with a median population age over 40, suffer greater losses than the younger populations.

- Analyzing the data provided by the Cabinet of Ministers of Ukraine about hospital beds, medical staff, and ICUs availability, we have found out, that Ukrainian hospitals are well-equipped to treat the patients. As of the 8th of April, 730 doctors and 1287 nurses are already occupied with COVID-19 patients. It is about 7% of all medical staff, qualified to work with infected patients. Over 90% of artificial breathing ventilators are available.

- Everyone wonders - what is the future, can we predict how many people will get infected, when will quarantine end, when the virus will stop spreading? However, performing any kind of predictive analysis would involve a very vague hypothesis that the situation is static. When in reality, it is anything but static. There exist multiple factors that prevent us from building a robust machine learning model. For example, it is difficult to predict how well do the citizens comply with the quarantine policies. It is a challenge to estimate the spread rate of virus between medical staff - how many doctors and nurses will there be available at time N? It is as well hard to predict the possible date of cure invention.

In [81]:
from nbconvert import HTMLExporter
import codecs
import nbformat

notebook_name = 'avenga-covid19.ipynb'
output_file_name = 'avenga-covid19.html'

exporter = HTMLExporter()
output_notebook = nbformat.read(notebook_name, as_version=4)

output, resources = exporter.from_notebook_node(output_notebook)
codecs.open(output_file_name, 'w', encoding='utf-8').write(output)

FILE = "avenga-covid19.html"

with open(FILE, 'r') as html_file:
    content = html_file.read()

# Get rid off prompts and source code
content = content.replace("div.input_area {
	display: none;","div.input_area {
	display: none;\n\tdisplay: none;")    
content = content.replace(".prompt {
	display: none;",".prompt {
	display: none;\n\tdisplay: none;")

f = open(FILE, 'w')
f.write(content)
f.close()
In [ ]: